65 research outputs found
Hybrid Collaborative Filtering with Autoencoders
Collaborative Filtering aims at exploiting the feedback of users to provide
personalised recommendations. Such algorithms look for latent variables in a
large sparse matrix of ratings. They can be enhanced by adding side information
to tackle the well-known cold start problem. While Neu-ral Networks have
tremendous success in image and speech recognition, they have received less
attention in Collaborative Filtering. This is all the more surprising that
Neural Networks are able to discover latent variables in large and
heterogeneous datasets. In this paper, we introduce a Collaborative Filtering
Neural network architecture aka CFN which computes a non-linear Matrix
Factorization from sparse rating inputs and side information. We show
experimentally on the MovieLens and Douban dataset that CFN outper-forms the
state of the art and benefits from side information. We provide an
implementation of the algorithm as a reusable plugin for Torch, a popular
Neural Network framework
Learning Visual Reasoning Without Strong Priors
Achieving artificial visual reasoning - the ability to answer image-related
questions which require a multi-step, high-level process - is an important step
towards artificial general intelligence. This multi-modal task requires
learning a question-dependent, structured reasoning process over images from
language. Standard deep learning approaches tend to exploit biases in the data
rather than learn this underlying structure, while leading methods learn to
visually reason successfully but are hand-crafted for reasoning. We show that a
general-purpose, Conditional Batch Normalization approach achieves
state-of-the-art results on the CLEVR Visual Reasoning benchmark with a 2.4%
error rate. We outperform the next best end-to-end method (4.5%) and even
methods that use extra supervision (3.1%). We probe our model to shed light on
how it reasons, showing it has learned a question-dependent, multi-step
process. Previous work has operated under the assumption that visual reasoning
calls for a specialized architecture, but we show that a general architecture
with proper conditioning can learn to visually reason effectively.Comment: Full AAAI 2018 paper is at arXiv:1709.07871. Presented at ICML 2017's
Machine Learning in Speech and Language Processing Workshop. Code is at
http://github.com/ethanjperez/fil
FiLM: Visual Reasoning with a General Conditioning Layer
We introduce a general-purpose conditioning method for neural networks called
FiLM: Feature-wise Linear Modulation. FiLM layers influence neural network
computation via a simple, feature-wise affine transformation based on
conditioning information. We show that FiLM layers are highly effective for
visual reasoning - answering image-related questions which require a
multi-step, high-level process - a task which has proven difficult for standard
deep learning methods that do not explicitly model reasoning. Specifically, we
show on visual reasoning tasks that FiLM layers 1) halve state-of-the-art error
for the CLEVR benchmark, 2) modulate features in a coherent manner, 3) are
robust to ablations and architectural modifications, and 4) generalize well to
challenging, new data from few examples or even zero-shot.Comment: AAAI 2018. Code available at http://github.com/ethanjperez/film .
Extends arXiv:1707.0301
HoME: a Household Multimodal Environment
We introduce HoME: a Household Multimodal Environment for artificial agents
to learn from vision, audio, semantics, physics, and interaction with objects
and other agents, all within a realistic context. HoME integrates over 45,000
diverse 3D house layouts based on the SUNCG dataset, a scale which may
facilitate learning, generalization, and transfer. HoME is an open-source,
OpenAI Gym-compatible platform extensible to tasks in reinforcement learning,
language grounding, sound-based navigation, robotics, multi-agent learning, and
more. We hope HoME better enables artificial agents to learn as humans do: in
an interactive, multimodal, and richly contextualized setting.Comment: Presented at NIPS 2017's Visually-Grounded Interaction and Language
Worksho
End-to-end optimization of goal-driven and visually grounded dialogue systems
End-to-end design of dialogue systems has recently become a popular research
topic thanks to powerful tools such as encoder-decoder architectures for
sequence-to-sequence learning. Yet, most current approaches cast human-machine
dialogue management as a supervised learning problem, aiming at predicting the
next utterance of a participant given the full history of the dialogue. This
vision is too simplistic to render the intrinsic planning problem inherent to
dialogue as well as its grounded nature, making the context of a dialogue
larger than the sole history. This is why only chit-chat and question answering
tasks have been addressed so far using end-to-end architectures. In this paper,
we introduce a Deep Reinforcement Learning method to optimize visually grounded
task-oriented dialogues, based on the policy gradient algorithm. This approach
is tested on a dataset of 120k dialogues collected through Mechanical Turk and
provides encouraging results at solving both the problem of generating natural
dialogues and the task of discovering a specific object in a complex picture
GuessWhat?! Visual object discovery through multi-modal dialogue
We introduce GuessWhat?!, a two-player guessing game as a testbed for
research on the interplay of computer vision and dialogue systems. The goal of
the game is to locate an unknown object in a rich image scene by asking a
sequence of questions. Higher-level image understanding, like spatial reasoning
and language grounding, is required to solve the proposed task. Our key
contribution is the collection of a large-scale dataset consisting of 150K
human-played games with a total of 800K visual question-answer pairs on 66K
images. We explain our design decisions in collecting the dataset and introduce
the oracle and questioner tasks that are associated with the two players of the
game. We prototyped deep learning models to establish initial baselines of the
introduced tasks.Comment: 23 pages; CVPR 2017 submission; see https://guesswhat.a
A Machine of Few Words -- Interactive Speaker Recognition with Reinforcement Learning
Speaker recognition is a well known and studied task in the speech processing
domain. It has many applications, either for security or speaker adaptation of
personal devices. In this paper, we present a new paradigm for automatic
speaker recognition that we call Interactive Speaker Recognition (ISR). In this
paradigm, the recognition system aims to incrementally build a representation
of the speakers by requesting personalized utterances to be spoken in contrast
to the standard text-dependent or text-independent schemes. To do so, we cast
the speaker recognition task into a sequential decision-making problem that we
solve with Reinforcement Learning. Using a standard dataset, we show that our
method achieves excellent performance while using little speech signal amounts.
This method could also be applied as an utterance selection mechanism for
building speech synthesis systems
HIGhER : Improving instruction following with Hindsight Generation for Experience Replay
Language creates a compact representation of the world and allows the
description of unlimited situations and objectives through compositionality.
While these characterizations may foster instructing, conditioning or
structuring interactive agent behavior, it remains an open-problem to correctly
relate language understanding and reinforcement learning in even simple
instruction following scenarios. This joint learning problem is alleviated
through expert demonstrations, auxiliary losses, or neural inductive biases. In
this paper, we propose an orthogonal approach called Hindsight Generation for
Experience Replay (HIGhER) that extends the Hindsight Experience Replay (HER)
approach to the language-conditioned policy setting. Whenever the agent does
not fulfill its instruction, HIGhER learns to output a new directive that
matches the agent trajectory, and it relabels the episode with a positive
reward. To do so, HIGhER learns to map a state into an instruction by using
past successful trajectories, which removes the need to have external expert
interventions to relabel episodes as in vanilla HER. We show the efficiency of
our approach in the BabyAI environment, and demonstrate how it complements
other instruction following methods.Comment: Accepted at ADPRL'2
- …